Applying Machine Learning Techniques-Regression

Homepage: https://github.com/tien-le/kaggle-titanic

Updating later ...


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
import random

Load Corpus After Preprocessing ...


In [2]:
#Training Corpus
trn_corpus_after_preprocessing = pd.read_csv("output/trn_corpus_after_preprocessing.csv")

#Testing Corpus
tst_corpus_after_preprocessing = pd.read_csv("output/tst_corpus_after_preprocessing.csv")

In [3]:
#tst_corpus_after_preprocessing[tst_corpus_after_preprocessing["Fare"].isnull()]

In [4]:
trn_corpus_after_preprocessing.info()
print("-"*36)
tst_corpus_after_preprocessing.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 890 entries, 0 to 889
Data columns (total 13 columns):
PassengerId          890 non-null int64
Male                 890 non-null int64
Pclass               890 non-null int64
Fare                 890 non-null float64
FarePerPerson        890 non-null float64
Title                890 non-null int64
AgeUsingMeanTitle    890 non-null float64
AgeClass             890 non-null float64
SexClass             890 non-null int64
FamilySize           890 non-null int64
AgeSquared           890 non-null float64
AgeClassSquared      890 non-null float64
Survived             890 non-null int64
dtypes: float64(6), int64(7)
memory usage: 90.5 KB
------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 13 columns):
PassengerId          418 non-null int64
Male                 418 non-null int64
Pclass               418 non-null int64
Fare                 418 non-null float64
FarePerPerson        418 non-null float64
Title                418 non-null int64
AgeUsingMeanTitle    418 non-null float64
AgeClass             418 non-null float64
SexClass             418 non-null int64
FamilySize           418 non-null int64
AgeSquared           418 non-null float64
AgeClassSquared      418 non-null float64
Survived             418 non-null int64
dtypes: float64(6), int64(7)
memory usage: 42.5 KB

Basic & Advanced machine learning tools

Agenda

  • What is machine learning?
  • What are the two main categories of machine learning?
  • What are some examples of machine learning?
  • How does machine learning "work"?

What is machine learning?

One definition: "Machine learning is the semi-automated extraction of knowledge from data"

  • Knowledge from data: Starts with a question that might be answerable using data
  • Automated extraction: A computer provides the insight
  • Semi-automated: Requires many smart decisions by a human

What are the two main categories of machine learning?

Supervised learning: Making predictions using data

  • Example: Is a given email "spam" or "ham"?
  • There is an outcome we are trying to predict

Unsupervised learning: Extracting structure from data

  • Example: Segment grocery store shoppers into clusters that exhibit similar behaviors
  • There is no "right answer"

How does machine learning "work"?

High-level steps of supervised learning:

  1. First, train a machine learning model using labeled data

    • "Labeled data" has been labeled with the outcome
    • "Machine learning model" learns the relationship between the attributes of the data and its outcome
  2. Then, make predictions on new data for which the label is unknown

The primary goal of supervised learning is to build a model that "generalizes": It accurately predicts the future rather than the past!

Questions about machine learning

  • How do I choose which attributes of my data to include in the model?
  • How do I choose which model to use?
  • How do I optimize this model for best performance?
  • How do I ensure that I'm building a model that will generalize to unseen data?
  • Can I estimate how well my model is likely to perform on unseen data?

Benefits and drawbacks of scikit-learn

Benefits:

  • Consistent interface to machine learning models
  • Provides many tuning parameters but with sensible defaults
  • Exceptional documentation
  • Rich set of functionality for companion tasks
  • Active community for development and support

Potential drawbacks:

  • Harder (than R) to get started with machine learning
  • Less emphasis (than R) on model interpretability

Further reading:


In [ ]:

Types of supervised learning

  • Classification: Predict a categorical response
  • Regression: Predict a ordered/continuous response

  • Note that each value we are predicting is the response (also known as: target, outcome, label, dependent variable)

Model evaluation metrics

  • Regression problems: Mean Absolute Error, Mean Squared Error, Root Mean Squared Error
  • Classification problems: Classification accuracy

Load Corpus


In [5]:
trn_corpus_after_preprocessing.columns


Out[5]:
Index(['PassengerId', 'Male', 'Pclass', 'Fare', 'FarePerPerson', 'Title',
       'AgeUsingMeanTitle', 'AgeClass', 'SexClass', 'FamilySize', 'AgeSquared',
       'AgeClassSquared', 'Survived'],
      dtype='object')

In [6]:
list_of_non_preditor_variables = ['Survived','PassengerId']

In [7]:
#Method 1
#x_train = trn_corpus_after_preprocessing.ix[:, trn_corpus_after_preprocessing.columns != 'Survived']
#y_train = trn_corpus_after_preprocessing.ix[:,"Survived"]

#Method 2
x_train = trn_corpus_after_preprocessing[trn_corpus_after_preprocessing.columns.difference(list_of_non_preditor_variables)].copy()
y_train = trn_corpus_after_preprocessing['Survived'].copy()
#y_train = trn_corpus_after_preprocessing.iloc[:,-1]
#y_train = trn_corpus_after_preprocessing[trn_corpus_after_preprocessing.columns[-1]]

#x_train

In [8]:
#y_train

In [9]:
x_train.columns


Out[9]:
Index(['AgeClass', 'AgeClassSquared', 'AgeSquared', 'AgeUsingMeanTitle',
       'FamilySize', 'Fare', 'FarePerPerson', 'Male', 'Pclass', 'SexClass',
       'Title'],
      dtype='object')

In [10]:
# check the types of the features and response
#print(type(x_train))
#print(type(x_test))

In [11]:
#Method 1
#x_test = tst_corpus_after_preprocessing.ix[:, trn_corpus_after_preprocessing.columns != 'Survived']
#y_test = tst_corpus_after_preprocessing.ix[:,"Survived"]

#Method 2
x_test = tst_corpus_after_preprocessing[tst_corpus_after_preprocessing.columns.difference(list_of_non_preditor_variables)].copy()
y_test = tst_corpus_after_preprocessing['Survived'].copy()
#y_test = tst_corpus_after_preprocessing.iloc[:,-1]
#y_test = tst_corpus_after_preprocessing[tst_corpus_after_preprocessing.columns[-1]]

In [12]:
#x_test

In [13]:
#y_test

In [14]:
# display the first 5 rows
x_train.head()


Out[14]:
AgeClass AgeClassSquared AgeSquared AgeUsingMeanTitle FamilySize Fare FarePerPerson Male Pclass SexClass Title
0 66.0 4356.0 484.0 22.0 1 7.2500 3.62500 1 3 3 3
1 38.0 1444.0 1444.0 38.0 1 71.2833 35.64165 0 1 0 3
2 78.0 6084.0 676.0 26.0 0 7.9250 7.92500 0 3 0 3
3 35.0 1225.0 1225.0 35.0 1 53.1000 26.55000 0 1 0 3
4 105.0 11025.0 1225.0 35.0 0 8.0500 8.05000 1 3 3 3

In [15]:
# display the last 5 rows
x_train.tail()


Out[15]:
AgeClass AgeClassSquared AgeSquared AgeUsingMeanTitle FamilySize Fare FarePerPerson Male Pclass SexClass Title
885 117.000000 13689.00000 1521.000000 39.000000 5 29.125 4.854167 0 3 0 3
886 54.000000 2916.00000 729.000000 27.000000 0 13.000 13.000000 1 2 2 0
887 19.000000 361.00000 361.000000 19.000000 0 30.000 30.000000 0 1 0 3
888 86.061263 7406.54097 822.948997 28.687088 3 23.450 5.862500 0 3 0 3
889 26.000000 676.00000 676.000000 26.000000 0 30.000 30.000000 1 1 1 3

In [16]:
# check the shape of the DataFrame (rows, columns)
x_train.shape


Out[16]:
(890, 11)

What are the features?

  • AgeClass:
  • AgeClassSquared:
  • AgeSquared:
  • ...

What is the response?

  • Survived: 1-Yes, 0-No

What else do we know?

  • Because the response variable is dicrete, this is a Classification problem.
  • There are 200 observations (represented by the rows), and each observation is a single market.

Note that if the response variable is continuous, this is a regression problem.


In [ ]:

Decision Trees Classification


In [17]:
from sklearn import tree

clf = tree.DecisionTreeClassifier()
clf = clf.fit(x_train, y_train)

In [18]:
#Once trained, we can export the tree in Graphviz format using the export_graphviz exporter. 
#Below is an example export of a tree trained on the entire iris dataset:
with open("output/titanic.dot", 'w') as f:
    f = tree.export_graphviz(clf, out_file=f)

#Then we can use Graphviz’s dot tool to create a PDF file (or any other supported file type): 
#dot -Tpdf titanic.dot -o titanic.pdf.
import os
os.unlink('output/titanic.dot')

#Alternatively, if we have Python module pydotplus installed, we can generate a PDF file 
#(or any other supported file type) directly in Python:
import pydotplus 
dot_data = tree.export_graphviz(clf, out_file=None) 
graph = pydotplus.graph_from_dot_data(dot_data) 
graph.write_pdf("output/titanic.pdf")


Out[18]:
True

In [19]:
#The export_graphviz exporter also supports a variety of aesthetic options, 
#including coloring nodes by their class (or value for regression) 
#and using explicit variable and class names if desired. 
#IPython notebooks can also render these plots inline using the Image() function:


"""from IPython.display import Image  
dot_data = tree.export_graphviz(clf, out_file=None, 
                         feature_names= list(x_train.columns[1:]), #iris.feature_names,  
                         class_names= ["Survived"], #iris.target_names,  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = pydotplus.graph_from_dot_data(dot_data)  
Image(graph.create_png())"""


Out[19]:
'from IPython.display import Image  \ndot_data = tree.export_graphviz(clf, out_file=None, \n                         feature_names= list(x_train.columns[1:]), #iris.feature_names,  \n                         class_names= ["Survived"], #iris.target_names,  \n                         filled=True, rounded=True,  \n                         special_characters=True)  \ngraph = pydotplus.graph_from_dot_data(dot_data)  \nImage(graph.create_png())'

In [20]:
print("accuracy score: ", clf.score(x_test,y_test))


accuracy score:  0.775119617225

Classification accuracy: percentage of correct predictions


In [21]:
#After being fitted, the model can then be used to predict the class of samples:
y_pred_class = clf.predict(x_test);

#Alternatively, the probability of each class can be predicted, 
#which is the fraction of training samples of the same class in a leaf:
clf.predict_proba(x_test);

In [22]:
# calculate accuracy
from sklearn import metrics

print(metrics.accuracy_score(y_test, y_pred_class))


0.775119617225

Null accuracy: accuracy that could be achieved by always predicting the most frequent class


In [23]:
# examine the class distribution of the testing set (using a Pandas Series method)
y_test.value_counts()


Out[23]:
0    266
1    152
Name: Survived, dtype: int64

In [24]:
# calculate the percentage of ones
y_test.mean()


Out[24]:
0.36363636363636365

In [25]:
# calculate the percentage of zeros
1 - y_test.mean()


Out[25]:
0.63636363636363635

In [26]:
# calculate null accuracy (for binary classification problems coded as 0/1)
max(y_test.mean(), 1 - y_test.mean())


Out[26]:
0.63636363636363635

In [27]:
# calculate null accuracy (for multi-class classification problems)
y_test.value_counts().head(1) / len(y_test)


Out[27]:
0    0.636364
Name: Survived, dtype: float64

Comparing the true and predicted response values


In [28]:
# print the first 25 true and predicted responses
from __future__ import print_function
print('True:', y_test.values[0:25])
print('Pred:', y_pred_class[0:25])


True: [0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 0 1]
Pred: [0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1]

Conclusion: ???

  • Classification accuracy is the easiest classification metric to understand
  • But, it does not tell you the underlying distribution of response values
  • And, it does not tell you what "types" of errors your classifier is making

Confusion matrix

Table that describes the performance of a classification model


In [29]:
# IMPORTANT: first argument is true values, second argument is predicted values
print(metrics.confusion_matrix(y_test, y_pred_class))


[[214  52]
 [ 42 110]]

Basic terminology

  • True Positives (TP): we correctly predicted that they do have diabetes
  • True Negatives (TN): we correctly predicted that they don't have diabetes
  • False Positives (FP): we incorrectly predicted that they do have diabetes (a "Type I error")
  • False Negatives (FN): we incorrectly predicted that they don't have diabetes (a "Type II error")

In [30]:
# save confusion matrix and slice into four pieces
confusion = metrics.confusion_matrix(y_test, y_pred_class)
TP = confusion[1, 1]
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]

In [31]:
print(TP, TN, FP, FN)


110 214 52 42

Metrics computed from a confusion matrix

Classification Accuracy: Overall, how often is the classifier correct?


In [32]:
print((TP + TN) / float(TP + TN + FP + FN))
print(metrics.accuracy_score(y_test, y_pred_class))


0.775119617225
0.775119617225

Classification Error: Overall, how often is the classifier incorrect?

  • Also known as "Misclassification Rate"

In [33]:
print((FP + FN) / float(TP + TN + FP + FN))
print(1 - metrics.accuracy_score(y_test, y_pred_class))


0.224880382775
0.224880382775

Specificity: When the actual value is negative, how often is the prediction correct?

  • How "specific" (or "selective") is the classifier in predicting positive instances?

In [34]:
print(TN / float(TN + FP))


0.804511278195

False Positive Rate: When the actual value is negative, how often is the prediction incorrect?


In [35]:
print(FP / float(TN + FP))


0.195488721805

Precision: When a positive value is predicted, how often is the prediction correct?

  • How "precise" is the classifier when predicting positive instances?

In [36]:
print(TP / float(TP + FP))
print(metrics.precision_score(y_test, y_pred_class))


0.679012345679
0.679012345679

In [37]:
print("Presicion: ", metrics.precision_score(y_test, y_pred_class))
print("Recall: ", metrics.recall_score(y_test, y_pred_class))
print("F1 score: ", metrics.f1_score(y_test, y_pred_class))


Presicion:  0.679012345679
Recall:  0.723684210526
F1 score:  0.700636942675

Many other metrics can be computed: F1 score, Matthews correlation coefficient, etc.

Conclusion:

  • Confusion matrix gives you a more complete picture of how your classifier is performing
  • Also allows you to compute various classification metrics, and these metrics can guide your model selection

Which metrics should you focus on?

  • Choice of metric depends on your business objective
  • Spam filter (positive class is "spam"): Optimize for precision or specificity because false negatives (spam goes to the inbox) are more acceptable than false positives (non-spam is caught by the spam filter)
  • Fraudulent transaction detector (positive class is "fraud"): Optimize for sensitivity because false positives (normal transactions that are flagged as possible fraud) are more acceptable than false negatives (fraudulent transactions that are not detected)

Support Vector Machine (SVM)

Linear Support Vector Classification.

Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.

Ref: http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC


In [38]:
from sklearn import svm

model = svm.LinearSVC()

model.fit(x_train, y_train)


Out[38]:
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [39]:
acc_score = model.score(x_test, y_test)

print("Accuracy score: ", acc_score)


Accuracy score:  0.593301435407

In [40]:
y_pred_class = model.predict(x_test)

In [41]:
from sklearn import metrics

In [42]:
confusion_matrix = metrics.confusion_matrix(y_test, y_pred_class)

print(confusion_matrix)


[[219  47]
 [123  29]]

Classifier comparison

http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

A comparison of a several classifiers in scikit-learn on synthetic datasets. The point of this example is to illustrate the nature of decision boundaries of different classifiers. This should be taken with a grain of salt, as the intuition conveyed by these examples does not necessarily carry over to real datasets.

Particularly in high-dimensional spaces, data can more easily be separated linearly and the simplicity of classifiers such as naive Bayes and linear SVMs might lead to better generalization than is achieved by other classifiers.

The plots show training points in solid colors and testing points semi-transparent. The lower right shows the classification accuracy on the test set.


In [43]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from matplotlib.colors import ListedColormap

In [ ]:


In [44]:
#classifiers

In [45]:
#x_train

In [46]:
#sns.pairplot(x_train)

In [47]:
x_train_scaled = StandardScaler().fit_transform(x_train)

x_test_scaled = StandardScaler().fit_transform(x_test)

In [48]:
x_train_scaled[0]


Out[48]:
array([ 0.02743953, -0.187067  , -0.64095047, -0.59652571,  0.05850706,
       -0.50278454, -0.4549534 ,  0.73833521,  0.82816049,  1.10507752,
        0.1608944 ])

In [49]:
len(x_train_scaled[0])


Out[49]:
11

In [50]:
df_x_train_scaled = pd.DataFrame(columns=x_train.columns, data=x_train_scaled)

In [51]:
df_x_train_scaled.head()


Out[51]:
AgeClass AgeClassSquared AgeSquared AgeUsingMeanTitle FamilySize Fare FarePerPerson Male Pclass SexClass Title
0 0.027440 -0.187067 -0.640950 -0.596526 0.058507 -0.502785 -0.454953 0.738335 0.828160 1.105078 0.160894
1 -0.820101 -0.747159 0.436930 0.633468 0.058507 0.785958 0.438395 -1.354398 -1.564901 -1.175106 0.160894
2 0.390671 0.145295 -0.425374 -0.289027 -0.561389 -0.489199 -0.334972 -1.354398 0.828160 -1.175106 0.160894
3 -0.910909 -0.789281 0.191038 0.402844 0.058507 0.419998 0.184714 -1.354398 -1.564901 -1.175106 0.160894
4 1.207943 1.095643 0.191038 0.402844 -0.561389 -0.486684 -0.331484 0.738335 0.828160 1.105078 0.160894

In [52]:
#sns.pairplot(df_x_train_scaled)

In [53]:
names = ["Nearest Neighbors", "Linear SVM", "RBF SVM",
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
         "Naive Bayes", "QDA", "Gaussian Process"]

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis()
    #, GaussianProcessClassifier(1.0 * RBF(1.0), warm_start=True), # Take too long...
    ]

# iterate over classifiers
for name, model in zip(names, classifiers):
    model.fit(x_train_scaled, y_train)
    acc_score = model.score(x_test_scaled, y_test)
    print(name, " - accuracy score: ", acc_score)
#end for


Nearest Neighbors  - accuracy score:  0.777511961722
Linear SVM  - accuracy score:  1.0
RBF SVM  - accuracy score:  0.877990430622
Decision Tree  - accuracy score:  0.937799043062
Random Forest  - accuracy score:  0.856459330144
Neural Net  - accuracy score:  0.911483253589
AdaBoost  - accuracy score:  0.88038277512
Naive Bayes  - accuracy score:  0.827751196172
QDA  - accuracy score:  0.777511961722

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:

Decision Tree Regressor

Ref: http://scikit-learn.org/stable/modules/tree.html

Decision trees can also be applied to regression problems, using the DecisionTreeRegressor class.

As in the classification setting, the fit method will take as argument arrays X and y, only that in this case y is expected to have floating point values instead of integer values:


In [54]:
from sklearn import tree


clf = tree.DecisionTreeRegressor()
clf = clf.fit(x_train, y_train)

clf.score(x_test,y_test)


Out[54]:
0.018129242376775268

In [55]:
#clf.predict(x_test)

Random Forests


In [ ]:


In [ ]:

Naive Bayes


In [ ]:

Simple Linear Regression

Recall that Simple Linear Regression is given by the following equation: $y = \alpha + \beta x$

Our goal is to solve the values $\alpha$ and $\beta$ that minimize the cost function.

$$\beta = \frac{cov(x,y)}{var(x)}$$

where $cov(x,y)$ denotes a measure of how far a set of values is spread out.

Note that:

  • Variance is zero if all of the features are spread out equally.
  • A SMALL variance indicates that the numbers are NEAR the mean of the set
  • A LARGE variance when the numbers are FAR the mean of the set
$$var(x) = \frac{\sum\limits_{i=1}^{n}{\left( x_i - \overline{x} \right)}}{n-1}$$$$cov(x,y) = \frac{\sum\limits_{i=1}^{n}{\left( x_i - \overline{x} \right)\left( y_i - \overline{y} \right)}}{n-1}$$

Having solved $\beta$, we can estimate $\alpha$ using the following formula: $$\alpha = \overline{y} - \beta \overline{x}$$

Evaluating the Model

Using r-squared - that measures how well the observed values of the response variables are predicted by the model. In the case of simple linear regression, r-squared is equal to Pearson's r. In this method, r-squared must be a positive number between zero and one. In others, r-squared can return a negative number if the model performs extremely poorly.


In [56]:
from sklearn.linear_model import LinearRegression

In [57]:
model = LinearRegression()

model.fit(x_train, y_train)

r_squared = model.score(x_test, y_test)

print("R-squared: %.4f" %r_squared)


R-squared: 0.6787

Multiple Linear Regresssion

Formally, multiple linear regression is the following model:

$$y = \alpha+\beta_1x_1+\beta_2x_2+...+\beta_nx_n$$

or

$$Y = X\beta$$

where $Y$ denotes a column vector of the values of the response variables for training, $\beta$ denotes a column vector of the values of the model's parameters, $X$ is called the design matrix, an $m \times n$ dimensional matrix of the values of the features.

We can solve $\beta$ as follows:

$$\beta = \left( X^TX \right)^{-1}X^TY$$

Note that - code python:

from numpy import dot, transpose
beta = dot(inv(dot(transpose(X),X)), dot(transpose(X), Y))

In [58]:
from sklearn.linear_model import LinearRegression

In [59]:
model = LinearRegression()

model.fit(x_train, y_train)

predictions = model.predict(x_test)

In [60]:
#for i in range(predictions.size):
#    print("Predicted: %.2f, Target: %.2f" %(predictions[i], y_test[i]))

r_squared = model.score(x_test, y_test)
    
print("R-squared: %.4f" %r_squared)


R-squared: 0.6787

Polynomialy Regression

Quadratic Regression, regession with a second order polynomial, is given by the following formula:

$$y = \alpha +\beta_1x^1+\beta_2x^2$$

In [61]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

In [62]:
model = LinearRegression()

model.fit(x_train, y_train)


Out[62]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [63]:
xx = np.linspace(0, 26, 100)
#yy = np.linspace(0, 26, 100)

#yy = model.predict(xx.reshape(xx.shape[0],1))

#plt.plot(xx, yy)

In [64]:
quadratic_featurizer = PolynomialFeatures(degree=2)

x_train_quadratic = quadratic_featurizer.fit_transform(x_train)
x_test_quadratic = quadratic_featurizer.fit(x_test)

In [65]:
x_train.head()


Out[65]:
AgeClass AgeClassSquared AgeSquared AgeUsingMeanTitle FamilySize Fare FarePerPerson Male Pclass SexClass Title
0 66.0 4356.0 484.0 22.0 1 7.2500 3.62500 1 3 3 3
1 38.0 1444.0 1444.0 38.0 1 71.2833 35.64165 0 1 0 3
2 78.0 6084.0 676.0 26.0 0 7.9250 7.92500 0 3 0 3
3 35.0 1225.0 1225.0 35.0 1 53.1000 26.55000 0 1 0 3
4 105.0 11025.0 1225.0 35.0 0 8.0500 8.05000 1 3 3 3

In [66]:
model_quadratic = LinearRegression()

model_quadratic.fit(x_train_quadratic, y_train)

#predictions = model_quadratic.predict(x_test_quadratic)

#r_squared = model_quadratic.score(x_test_quadratic, y_test)

#r_squared
    
#print("R-squared: %.4f" %r_squared)


Out[66]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [ ]:


In [ ]:


In [ ]:


In [ ]:

Linear Regression 2


In [ ]:

Logistic Regression


In [ ]:

SVM


In [ ]:

KNN (K- Nearest Neighbors)


In [ ]:


In [ ]:


In [ ]: